Skip to content

perf(dedupe): skip unchanged rows, VALUES fast-write, prefetch vulnerability_ids#15046

Open
valentijnscholten wants to merge 3 commits into
DefectDojo:devfrom
valentijnscholten:perf/dedupe-skip-unchanged-and-values-write
Open

perf(dedupe): skip unchanged rows, VALUES fast-write, prefetch vulnerability_ids#15046
valentijnscholten wants to merge 3 commits into
DefectDojo:devfrom
valentijnscholten:perf/dedupe-skip-unchanged-and-values-write

Conversation

@valentijnscholten

@valentijnscholten valentijnscholten commented Jun 19, 2026

Copy link
Copy Markdown
Member

What

Performance optimizations for the hash-recompute / dedupe write path in mass_model_updater and Finding.get_vulnerability_ids.

The dedupe recalculation command performance relies heavily on bulk_update. Turns out this is performing a very expensive part of code to calculate a bit CASE WHEN .... clause. Because this is a hot path and the dedupe only updates 1 (or 3) hash code fields, I feel it's OK to use raw SQL here as it saves huge amounts of time.

mass_model_updater (dojo/utils.py)

  • skip-unchanged (skip_unchanged=True default): rows whose tracked fields weren't changed by the update function are no longer written. Re-running over already-correct data now issues zero UPDATEs. The old value is read from the page query (via __dict__, so deferred fields don't trigger an extra query).
  • writer hook: optional writer(model_type, batch, fields) callable to replace bulk_update for persisting a batch; defaults to bulk_update.

VALUES fast-write (dojo/finding/deduplication.py)

  • hashcode_values_writer writes the text hash columns with a single UPDATE … FROM (VALUES …) join instead of bulk_update's per-row CASE/WHEN expression tree (which is O(rows×fields) Python nodes to build/resolve and dominates wide-batch writes). PostgreSQL only; falls back to bulk_update on other backends. Values are bound as parameters.

Setting: DD_MASS_HASH_CODE_USE_SQL_WRITER (dojo/settings/settings.dist.py)

  • New env-backed setting (MASS_HASH_CODE_USE_SQL_WRITER, default True). When True, the dedupe hash_code pass uses the PostgreSQL VALUES-join fast writer; set False to always fall back to bulk_update (escape hatch if the SQL writer misbehaves on a given deployment). The default writer was also extracted into default_mass_model_writer so the bulk_update path is a named callable.

vulnerability_ids N+1 (dojo/models.py)

  • get_vulnerability_ids now reads finding.vulnerability_id_set.all() (prefetch-aware) instead of a fresh Vulnerability_Id.objects.filter(...) + .count(). For parsers whose hash config includes vulnerability_ids this was a per-finding COUNT + SELECT N+1.

dedupe command (dojo/management/commands/dedupe.py)

  • prefetch vulnerability_id_set; use the VALUES writer for the hash pass.

Why

The dedupe/recompute write path scales poorly on large finding sets: bulk_update builds a per-row CASE/WHEN for the updated columns (slow to build/resolve for wide batches), it writes every row even when the value is unchanged, and get_vulnerability_ids issued a query per finding for affected parsers. These changes remove all three: unchanged rows aren't written, changed rows use a VALUES join, and the vuln-id lookup honors prefetch.

Concretely, on a ~15k-finding all-changed batch the VALUES join avoids ~14s of Python spent building and resolving bulk_update's CASE/WHEN expression tree (the resolve_expression cost in profiling). That figure is purely the write mechanism — independent of any caching. The skip-unchanged and prefetch wins on top are workload-dependent (driven by how many rows actually changed and whether the parser hashes on vulnerability_ids).

Tests

unittests/test_mass_model_updater.py — skip-unchanged, writer hook, VALUES writer (incl NULL handling), fields=None side-effect mode. All green on PostgreSQL.

Compatibility

Backward-compatible: skip_unchanged only skips no-op writes; writer defaults to bulk_update; the VALUES path is Postgres-gated with a bulk_update fallback, and is additionally gated behind DD_MASS_HASH_CODE_USE_SQL_WRITER (default True) so it can be disabled per deployment without a code change.

…lnerability_ids

mass_model_updater optimizations for the hash-recompute / dedupe write path:

- skip_unchanged (default True): rows whose tracked fields were not changed by
  the update function are no longer written. Re-runs over already-correct data
  issue zero UPDATEs. Compared against values loaded by the page query (read from
  __dict__ so deferred fields don't trigger an extra query).
- writer hook: optional `writer(model_type, batch, fields)` callable to replace
  Django's bulk_update for a batch. Defaults to bulk_update.
- hashcode_values_writer (dojo/finding/deduplication.py): writes the text hash
  columns with a single `UPDATE ... FROM (VALUES ...)` join instead of
  bulk_update's per-row CASE/WHEN expression tree, which dominates wide-batch
  updates. PostgreSQL only; falls back to bulk_update otherwise. Values are bound
  as parameters.
- Finding.get_vulnerability_ids now reads the prefetch-aware reverse relation
  (finding.vulnerability_id_set.all()) instead of a fresh
  Vulnerability_Id.objects.filter(...) + .count(), so prefetch_related is honored
  (was a per-finding N+1 for parsers whose hash config includes vulnerability_ids).
- dedupe command: prefetch vulnerability_id_set and use the VALUES writer for the
  hash pass.

Adds unittests/test_mass_model_updater.py covering skip-unchanged, the writer
hook, the VALUES writer (incl NULL handling) and fields=None side-effect mode.
@valentijnscholten valentijnscholten modified the milestones: 3.0.100, 3.1.0 Jun 19, 2026
The new TestMassModelUpdater hardcoded fixtures = ["dojo_testdata.json"],
which fails under V3_FEATURE_LOCATIONS=True: loading the legacy Endpoint
rows trips the model's V3 deprecation guard during tagulous retag. Decorate
with @versioned_fixtures (the established repo pattern) so it loads
dojo_testdata_locations.json under V3.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@valentijnscholten valentijnscholten marked this pull request as ready for review June 21, 2026 06:49
@valentijnscholten valentijnscholten added the affects_pro PRs that affect Pro and need a coordinated release/merge moment. label Jun 21, 2026

@mtesauro mtesauro left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved

…ting

- Extract bulk_update into default_mass_model_writer so the default
  mass_model_updater write path is a named callable.
- Add DD_MASS_HASH_CODE_USE_SQL_WRITER (default True). When False, the
  dedupe hash_code pass falls back to bulk_update instead of the
  PostgreSQL VALUES-join writer.
@github-actions github-actions Bot added the settings_changes Needs changes to settings.py based on changes in settings.dist.py included in this PR label Jun 21, 2026
@Maffooch Maffooch requested review from blakeaowens and dogboat June 23, 2026 20:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

affects_pro PRs that affect Pro and need a coordinated release/merge moment. settings_changes Needs changes to settings.py based on changes in settings.dist.py included in this PR unittests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants